Content

This notebook goes through the following plan:

  • preprocessing the data for topic modeling
  • extract topics
  • visualize topic importance by neighborhood

Dataset

In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict
from matplotlib import pyplot as plt

listing_filename = 'listings.csv'
cols = ['id', 'name', 'neighbourhood_cleansed', 'neighborhood_overview', 'latitude', 'longitude']
df = pd.read_csv(listing_filename, usecols=cols)
In [2]:
df.count()
Out[2]:
id                        3585
name                      3585
neighborhood_overview     2170
neighbourhood_cleansed    3585
latitude                  3585
longitude                 3585
dtype: int64
In [3]:
# From the selected columns, some rows have only their neighborhood_overview text missing.
# Since we gather overviews by neighborhood for this analysis, we do not want redundant or ungenuine overviews.
# Therefore we drop these rows with missing neighborhood_overview values.
df.dropna(inplace=True)

Prepare sentences composing each neighborhood

- split to sentences
- tokenize
- lemmatize
- discard stopwords and others
In [4]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer, before="parser")
doc = nlp("This is a sentence. This is another sentence.")
In [5]:
# Let's first create a function that decides what to exclude of the tokens of the sentence
def is_excluded(token, avoidwords=[]):
    '''
    This function decides whether to include 'token' to the list of lemmas
    INPUT
        token - token object in a doc
        avoidwords - additional words to exclude
    OUTPUT
        is_excluded - boolean - whether to exlude the token to the list of lemmas
    '''
    rules = [token.is_stop,
             token.is_punct,
             token.is_space,
             token.like_num,
             token.lemma_ in avoidwords
            ]
    return any(rules)
In [6]:
# Create a function splits a neighborhood_overview text into sentences, 
# discards some tokens and returns the sentences and the sentences after processing
def prepare_sentences(overview, avoidwords=[]):
    '''
    This function splits a neighborhood_overview text into sentences, 
    discards some tokens and returns the sentences and the sentences after processing
    INPUT
        overview - text from neighborhood_overview
        avoidwords - specific words to exclude
    OUTPUT
        sentences_lemmas - list of sentences of the overview text after processing
        sentences - boolean - list of sentences of the overview text
    '''
    if pd.notnull(overview): # or isinstance(text, str):
        doc = nlp(overview)
        sentences = []
        sentences_lemmas = []
        for sent in doc.sents:
            lemmas = []
            for token in sent:        
                if not is_excluded(token, avoidwords):
                    lemmas.append(token.lemma_)
                sent_lemmas = ' '.join(lemmas)
            if len(sent_lemmas):
                sentences_lemmas.append(sent_lemmas)
                sentences.append(sent.text)
        return sentences_lemmas, sentences
    else:
        print("warning: empty text")
        return [], []

# test
prepare_sentences(u"""Tesla is looking at buying a U.S. startup in Boston for $6 million.
                  Startups are becoming juicy minutes after it.""",
                 avoidwords=['Boston'])
Out[6]:
(['Tesla look buy U.S. startup $', 'startup juicy minute'],
 ['Tesla is looking at buying a U.S. startup in Boston for $6 million.',
  '\n                  Startups are becoming juicy minutes after it.'])
In [7]:
# Apply prepare_sentences() to the whole dataset
# Each sentence can be linked to the neighborhood it is about using its index in the list
def prepare_data(dt, avoidwords=[]):
    '''
    This function applies prepare_sentences() to a dataframe
    INPUT
        dt - dataframe including neighbourhood_cleansed and neighborhood_overview
        avoidwords - specific words to exclude
    OUTPUT
        sents_lemmas - list of sentences from the neighborhood overviews after processing
        sents - list of sentences from the neighborhood overviews
        neighborhood_sentids - dictionary neighborhood --> ids of sentences in sents or sents_lemmas to
                               link each sentence in sents (or sents_lemmas) to the neighborhood it is about
    '''
    sents = []
    sents_lemmas = []
    #ntexts = []
    neighborhood_sentids = defaultdict(list)
    last_sentid = 0
    for index, overview, neighbourhood in zip(dt.index, dt.neighborhood_overview, dt.neighbourhood_cleansed):
        sent_lemmas, sent = prepare_sentences(overview, avoidwords)
        sents_lemmas += sent_lemmas
        sents += sent
        neighborhood_sentids[neighbourhood] += [i for i in range(last_sentid, last_sentid+len(sent))]
        last_sentid += len(sent)
    return sents_lemmas, sents, neighborhood_sentids

# test
sents_lemmas, sents, neighborhood_sentids = prepare_data(df.loc[:1, ["neighborhood_overview", 
                                                                     "neighbourhood_cleansed"]])
print("#### sents_lemmas:", sents_lemmas, sep='\n')
print("#### sents:", sents, sep='\n')
print("#### neighborhood_sentids:", neighborhood_sentids, sep='\n')
#### sents_lemmas:
['roslindale quiet convenient friendly', 'southern food try Redd Rozzie', 'Italian Delfino Sophia Grotto great', 'Birch St Bistro nice atmostphere little pricier', 'cook Fish market fresh fish daily Tony make sausage italian food wide variety delicious cheese chocolate Cheese Cellar Birch St.', 'room Roslindale diverse primarily residential neighborhood Boston', 'connect public transportation neighborhood easy access car', 'Roslindale Square nice business district supermarket', 'bank bakery etc', 'Guidebook recommendation', 'Arnold Arboretum step away']
#### sents:
['Roslindale is quiet, convenient and friendly.', " For Southern food try Redd's in Rozzie.", " Italian Delfino's or Sophia's Grotto are great.", 'Birch St Bistro has nice atmostphere--a little pricier.', "  If you are cooking the Fish Market has fresh fish daily; Tony's makes his own sausages and has Italian foods;  for  a wide variety of delicious cheeses and chocolates go to the Cheese Cellar on Birch St.", 'The room is in Roslindale, a diverse and primarily residential neighborhood of Boston.', "It's well connected via public transportation to other neighborhoods and easy to access by car.", 'Roslindale Square is a nice business district with supermarkets.', 'banks, a bakery, etc. (', 'See my Guidebook for some recommendations).', 'The Arnold Arboretum is just steps away.']
#### neighborhood_sentids:
defaultdict(<class 'list'>, {'Roslindale': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
In [8]:
neighborhoods = df["neighbourhood_cleansed"].unique()
X = df[["neighborhood_overview", "neighbourhood_cleansed"]]
# create sentences
# N.B. neighborhood names are excluded for the topics to remain unrelated to the neighborhoods 
sents_lemmas, sents, neighborhood_sentids = prepare_data(X, 
                        avoidwords=['Boston','neighborhood','jp', 'JP'] +' '.join(neighborhoods).split())

Part 1: Topics extraction from neighborhood overviews

Using topic modeling

What the topics are on the neighborhood overviews?

We will use Latent Dirichlet Allocation (LDA) to extract topics

In [10]:
# First create a document-term matrix 
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = cv.fit_transform(sents_lemmas)
In [11]:
from sklearn.decomposition import LatentDirichletAllocation
nbtopics = 5
LDA = LatentDirichletAllocation(n_components=nbtopics, random_state=42,
                               max_iter=80, evaluate_every=2)
# This can take awhile, we're dealing with a large amount of documents!
LDA.fit(dtm)
Out[11]:
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=2, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=80,
                          mean_change_tol=0.001, n_components=5, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)
In [12]:
LDA.n_iter_, LDA.perplexity(dtm)
Out[12]:
(43, 709.9403504506598)
In [13]:
# Showing Top Words Per Topic:

for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 20 WORDS FOR TOPIC #{index+1}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-20:]][::-1])
    print('\n')
THE TOP 20 WORDS FOR TOPIC #1
['walk', 'minute', 'downtown', 'close', 'access', 'station', 'easy', 'line', 'city', 'location', 'right', 'away', 'min', 'area', 'stop', 'locate', 'short', 'distance', 'bus', 'public']


THE TOP 20 WORDS FOR TOPIC #2
['restaurant', 'shop', 'walk', 'store', 'bar', 'great', 'street', 'cafe', 'food', 'away', 'block', 'good', 'minute', 'grocery', 'distance', 'coffee', 'italian', 'corner', 'local', 'market']


THE TOP 20 WORDS FOR TOPIC #3
['quiet', 'city', 'diverse', 'area', 'safe', 'family', 'live', 'residential', 'young', 'community', 'great', 'friendly', 'professional', 'people', 'home', 'love', 'student', 'street', 'locate', 'old']


THE TOP 20 WORDS FOR TOPIC #4
['mi', 'mile', 'center', 'museum', 'new', 'hall', 'restaurant', 'university', 'house', 'boston', 'fine', 'england', 'faneuil', 'trail', 'major', 'arts', 'freedom', 'home', 'garden', 'prudential']


THE TOP 20 WORDS FOR TOPIC #5
['street', 'walk', 'park', 'historic', 'beautiful', 'charles', 'river', 'newbury', 'public', 'square', 'locate', 'copley', 'pond', 'block', 'line', 'arboretum', 'apartment', 'away', 'center', 'space']


In [14]:
# For each neighborhood: How much proportion of the overviews are about topic X ?

topic_results = LDA.transform(dtm)
topic_asgn = topic_results.argmax(axis=1)

d = {**{"neighborhood": list(neighborhood_sentids.keys())}, **{f"topic {i+1}":[] for i in range(nbtopics)}}
d["sentences_count"] = []
d["dominant_topic"] = []
for nghbd, sentids in neighborhood_sentids.items():
    c = np.average(topic_results[sentids], axis=0)
    for i in range(nbtopics):
        d[f"topic {i+1}"].append(c[i])
    d["sentences_count"].append(len(sentids))
    d["dominant_topic"].append(np.argmax(c))
H = pd.DataFrame(d).sort_values(by="sentences_count", ascending=False).round(3)
H.set_index("neighborhood", inplace=True)
H.style.background_gradient(axis=0, subset=list(f"topic {i+1}" for i in range(nbtopics)))
Out[14]:
topic 1 topic 2 topic 3 topic 4 topic 5 sentences_count dominant_topic
neighborhood
Jamaica Plain 0.184000 0.258000 0.257000 0.070000 0.232000 1018 1
South End 0.180000 0.278000 0.198000 0.105000 0.240000 691 1
Dorchester 0.206000 0.213000 0.313000 0.116000 0.152000 577 2
Back Bay 0.209000 0.162000 0.139000 0.098000 0.391000 509 4
Allston 0.270000 0.285000 0.236000 0.061000 0.147000 484 1
Fenway 0.279000 0.175000 0.147000 0.216000 0.183000 436 0
South Boston 0.319000 0.234000 0.182000 0.134000 0.131000 435 0
Beacon Hill 0.193000 0.256000 0.142000 0.092000 0.317000 433 4
East Boston 0.253000 0.280000 0.277000 0.067000 0.123000 347 1
Brighton 0.317000 0.291000 0.202000 0.069000 0.121000 331 0
North End 0.236000 0.220000 0.241000 0.132000 0.171000 310 2
Roxbury 0.225000 0.215000 0.366000 0.077000 0.117000 308 2
Downtown 0.230000 0.197000 0.163000 0.188000 0.222000 296 0
Roslindale 0.237000 0.227000 0.268000 0.076000 0.192000 186 2
Charlestown 0.220000 0.187000 0.211000 0.210000 0.172000 169 0
Mission Hill 0.202000 0.235000 0.277000 0.169000 0.117000 168 2
South Boston Waterfront 0.381000 0.138000 0.142000 0.168000 0.171000 125 0
Chinatown 0.421000 0.092000 0.068000 0.337000 0.082000 84 0
West Roxbury 0.258000 0.234000 0.270000 0.099000 0.139000 76 2
West End 0.187000 0.169000 0.174000 0.178000 0.293000 57 4
Bay Village 0.258000 0.177000 0.198000 0.062000 0.305000 50 4
Hyde Park 0.262000 0.179000 0.317000 0.085000 0.158000 43 2
Mattapan 0.271000 0.296000 0.197000 0.049000 0.187000 42 1
Leather District 0.579000 0.103000 0.046000 0.100000 0.172000 26 0
Longwood Medical Area 0.232000 0.164000 0.386000 0.070000 0.148000 18 2

Part 2: Comparison between neighborhoods by topic

First, we create functions to get top neighborhoods per question and visualize.

In [42]:
def get_top_neighborhoods(H, topic, max_rank=3):
    """
    Gets top neighborhoods by the proportion of overviews dedicated to a topic
    INPUTS:
        H - dataframe - dataframe of topic proportions by neighborhood
        topic - string - topic of choice e.g. "topic 2"
        max_rank - integer - length of the desired ranking (e.g. max_rank=3 to get top 3 neighborhoods)
    OUTPUT:
        top_neighborhoods - list of the top max_rank neighborhoods
    """
    top_neighborhoods = H[topic].sort_values(ascending=False)
    #top_neighborhoods = list(top_neighborhoods.index[:max_rank])
    #return [(n, H.loc[n, topic]) for n in top_neighborhoods]
    return list(top_neighborhoods.index[:max_rank])

# Test
get_top_neighborhoods(H, "topic 1", max_rank=3)
Out[42]:
['Leather District', 'Chinatown', 'South Boston Waterfront']
In [65]:
# Visualize a specific topic importance with a map of 
# listings colored according to the topic proportion in their neighborhood
import plotly.express as px
from IPython.display import Image

def map_topic(df, H, topic):
    """
    Creates a map of listings colored according to the topic proportion in their neighborhood
    INPUTS:
        df - dataframe - dataset of listings
        H - dataframe - dataframe of topic proportions by neighborhood
        topic - string - topic of choice e.g. "topic 2"
        max_rank - integer - length of the desired ranking (e.g. max_rank=3 to get top 3 neighborhoods)
    OUTPUT:
        None
    """
    df[topic] = df["neighbourhood_cleansed"].apply(lambda n: H.loc[n, topic])
    fig = px.scatter_mapbox(df, lat="latitude", lon="longitude", color=topic, 
                            hover_name="neighbourhood_cleansed", zoom=11,
                            center={'lat':42.32,'lon':-71.08}, width=700, height=700,
                            )
    fig.update_layout(mapbox_style="open-street-map")
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    return fig

Topic 1: About well connected neighborhoods and proximity to downtown

In [67]:
max_rank = 3
topic = "topic 1"

top_neighborhoods = get_top_neighborhoods(H, topic, max_rank=3)
print(f"The top {max_rank} neighborhoods for {topic}:", ', '.join(top_neighborhoods))
The top 3 neighborhoods for topic 1: Leather District, Chinatown, South Boston Waterfront
In [68]:
fig = map_topic(df, H, topic)
img_bytes = fig.to_image(format="png")
Image(img_bytes)
Out[68]:

Topic 2: On having restaurants, bars and shops etc.

In [69]:
max_rank = 3
topic = "topic 2"

top_neighborhoods = get_top_neighborhoods(H, topic, max_rank=3)
print(f"The top {max_rank} neighborhoods for {topic}:", ', '.join(top_neighborhoods))
The top 3 neighborhoods for topic 2: Mattapan, Brighton, Allston
In [70]:
fig = map_topic(df, H, topic)
img_bytes = fig.to_image(format="png")
Image(img_bytes)
Out[70]:

Topic 3: About quiet, safe and residential neighborhoods

In [71]:
max_rank = 3
topic = "topic 3"

top_neighborhoods = get_top_neighborhoods(H, topic, max_rank=3)
print(f"The top {max_rank} neighborhoods for {topic}:", ', '.join(top_neighborhoods))
The top 3 neighborhoods for topic 3: Longwood Medical Area, Roxbury, Hyde Park
In [72]:
fig = map_topic(df, H, topic)
img_bytes = fig.to_image(format="png")
Image(img_bytes)
Out[72]:

Topic 4: About museums, art or universities.

In [74]:
max_rank = 3
topic = "topic 4"

top_neighborhoods = get_top_neighborhoods(H, topic, max_rank=3)
print(f"The top {max_rank} neighborhoods for {topic}:", ', '.join(top_neighborhoods))
The top 3 neighborhoods for topic 4: Chinatown, Fenway, Charlestown
In [75]:
fig = map_topic(df, H, topic)
img_bytes = fig.to_image(format="png")
Image(img_bytes)
Out[75]:

Topic 5: About walking, areas, historic streets, green space, rivers and ponds.

In [76]:
max_rank = 3
topic = "topic 5"

top_neighborhoods = get_top_neighborhoods(H, topic, max_rank=3)
print(f"The top {max_rank} neighborhoods for {topic}:", ', '.join(top_neighborhoods))
The top 3 neighborhoods for topic 5: Back Bay, Beacon Hill, Bay Village
In [77]:
fig = map_topic(df, H, topic)
img_bytes = fig.to_image(format="png")
Image(img_bytes)
Out[77]: